The data under study consists in counts of reads for the Thyroid carcinoma as a Summarized Experiment object @ref(import). This dataset comes from the “Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results” project, with 20115 analyzed genes and 572 samples.
The column data contains pehnotypic data, which corresponds to clinical variables and their corresponding metadata @ref(coldata), it contains 549 columns with different information about the samples. The metadata consists of two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be use in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.
The row data contains feature data @ref(rowdata). We can observe the characteristics of the studied genes such as their location, ranges, strand, symbol, length and CG content.
library(SummarizedExperiment)
thca <- readRDS("../seTHCA.rds")
thca
class: RangedSummarizedExperiment
dim: 20115 572
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(572): TCGA.4C.A93U.01A.11R.A39I.07
TCGA.BJ.A0YZ.01A.11R.A10U.07 ... TCGA.KS.A41J.11A.12R.A23N.07
TCGA.KS.A41L.11A.11R.A23N.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
dim(colData(thca))
[1] 572 549
colData(thca)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
type bcr_patient_uuid
<factor> <factor>
TCGA.4C.A93U.01A.11R.A39I.07 tumor E1C8AF06-9EA4-4062-9264-AA916E0025D1
TCGA.BJ.A0YZ.01A.11R.A10U.07 tumor f3ee888e-5116-4313-a2f6-3b1dcc3e4bc1
TCGA.BJ.A0Z0.01A.11R.A10U.07 tumor 72d8dcc3-0709-4fd1-98d4-fb75e9340758
TCGA.BJ.A0Z2.01A.11R.A10U.07 tumor f9ceffc0-d544-418d-b4a9-bd3c84e37026
TCGA.BJ.A0Z3.01A.11R.A13Y.07 tumor 331cae6e-2868-4c58-9302-709a9ff7d025
bcr_patient_barcode form_completion_date
<factor> <factor>
TCGA.4C.A93U.01A.11R.A39I.07 TCGA-4C-A93U 2014-11-12
TCGA.BJ.A0YZ.01A.11R.A10U.07 TCGA-BJ-A0YZ 2011-4-11
TCGA.BJ.A0Z0.01A.11R.A10U.07 TCGA-BJ-A0Z0 2011-4-15
TCGA.BJ.A0Z2.01A.11R.A10U.07 TCGA-BJ-A0Z2 2011-4-15
TCGA.BJ.A0Z3.01A.11R.A13Y.07 TCGA-BJ-A0Z3 2011-6-2
prospective_collection
<factor>
TCGA.4C.A93U.01A.11R.A39I.07 YES
TCGA.BJ.A0YZ.01A.11R.A10U.07 NO
TCGA.BJ.A0Z0.01A.11R.A10U.07 NO
TCGA.BJ.A0Z2.01A.11R.A10U.07 NO
TCGA.BJ.A0Z3.01A.11R.A13Y.07 NO
mcols(colData(thca), use.names=TRUE)
DataFrame with 549 rows and 2 columns
labelDescription
<character>
type sample type (tumor/normal)
bcr_patient_uuid bcr patient uuid
bcr_patient_barcode bcr patient barcode
form_completion_date form completion date
prospective_collection tissue prospective collection indicator
... ...
lymph_nodes_pelvic_pos_total total pelv lnp
lymph_nodes_aortic_examined_count total aor lnr
lymph_nodes_aortic_pos_by_he aln pos light micro
lymph_nodes_aortic_pos_by_ihc aln pos ihc
lymph_nodes_aortic_pos_total total aor-lnp
CDEID
<character>
type NA
bcr_patient_uuid NA
bcr_patient_barcode 2673794
form_completion_date NA
prospective_collection 3088492
... ...
lymph_nodes_pelvic_pos_total 3151828
lymph_nodes_aortic_examined_count 3104460
lymph_nodes_aortic_pos_by_he 3151832
lymph_nodes_aortic_pos_by_ihc 3151831
lymph_nodes_aortic_pos_total 3151827
metadata(thca)
$experimentData
Experiment data
Experimenter name: Piccolo SR
Laboratory: Dept. of Biology, Brigham Young University, Provo, UT (USA)
Contact information: stephen_piccolo@byu.edu
Title: Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results
URL: https://github.com/srp33/TCGA_RNASeq_Clinical
PMIDs: 26209429
No abstract available.
$annotation
[1] "org.Hs.eg.db"
$cancerTypeCode
[1] "THCA"
$cancerTypeDescription
[1] "Thyroid carcinoma"
$objectCreationDate
[1] "2016-04-25"
rowData(thca)
DataFrame with 20115 rows and 3 columns
symbol txlen txgc
<character> <integer> <numeric>
1 A1BG 3322 0.5644190
2 A2M 4844 0.4882329
3 NAT1 2280 0.3942982
4 NAT2 1322 0.3895613
5 SERPINA3 3067 0.5249429
... ... ... ...
20111 POTEB 1706 0.4308324
20112 SNORD124 104 0.4903846
20113 SNORD121B 81 0.4074074
20114 GAGE10 538 0.5055762
20115 BRWD1-IT2 1028 0.5924125
rowRanges(thca)
GRanges object with 20115 ranges and 3 metadata columns:
seqnames ranges strand | symbol txlen
<Rle> <IRanges> <Rle> | <character> <integer>
1 chr19 [58345178, 58362751] - | A1BG 3322
2 chr12 [ 9067664, 9116229] - | A2M 4844
9 chr8 [18170477, 18223689] + | NAT1 2280
10 chr8 [18391245, 18401218] + | NAT2 1322
12 chr14 [94592058, 94624646] + | SERPINA3 3067
... ... ... ... . ... ...
100996331 chr15 [20835372, 21877298] - | POTEB 1706
101340251 chr17 [40027542, 40027645] - | SNORD124 104
101340252 chr9 [33934296, 33934376] - | SNORD121B 81
102724473 chrX [49303669, 49319844] + | GAGE10 538
103091865 chr21 [39313935, 39314962] + | BRWD1-IT2 1028
txgc
<numeric>
1 0.5644190
2 0.4882329
9 0.3942982
10 0.3895613
12 0.5249429
... ...
100996331 0.4308324
101340251 0.4903846
101340252 0.4074074
102724473 0.5055762
103091865 0.5924125
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
In this supplementary material we will perform a quality assessment and normalization of the data, in order to do that, we will create a ‘DGEList’ object. In order to ease the manipulation of the data \(\log_2\) CPM values of expression are calculated @ref(logCPM).
library(edgeR)
dge <- DGEList(counts=assays(thca)$counts, genes=mcols(thca))
Warning in as.data.frame(x, row.names = NULL, optional = optional, ...):
Arguments in '...' ignored
saveRDS(dge, file.path("./", "dge.rds"))
assays(thca)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(thca)$logCPM[1:5, 1:5]
TCGA.4C.A93U.01A.11R.A39I.07 TCGA.BJ.A0YZ.01A.11R.A10U.07
1 2.408053 3.875800
2 9.483176 10.203685
9 -6.667872 -6.667872
10 -6.667872 -6.667872
12 3.688878 4.988388
TCGA.BJ.A0Z0.01A.11R.A10U.07 TCGA.BJ.A0Z2.01A.11R.A10U.07
1 2.778138 4.239719
2 8.890882 8.403990
9 -6.667872 -6.667872
10 -6.667872 -6.667872
12 4.804717 4.839157
TCGA.BJ.A0Z3.01A.11R.A13Y.07
1 2.655578
2 9.190160
9 -6.667872
10 -6.667872
12 4.605054
In thyroid cancer we consider some variables more important than others as, according to bibliography, significant differences are seen in the categories of this variables related to thyroid cancer. We want to have a first look to this categories before the analysis of the whole data. This will give us an idea of the distribution of our data and can be useful in further analysis @ref(numvars).
par(mfrow=c(3, 1))
#Type
tumor <- thca[ , colData(thca)$type == "tumor"]
normal <- thca[ , colData(thca)$type == "normal"]
dim(tumor)
[1] 20115 513
dim(normal)
[1] 20115 59
barplot(summary(colData(thca)$type), main="Type of sample", xlab="Sample type", ylab="Total counts")
#Gender
male <- thca[ , !is.na(thca$gender) & colData(thca)$gender == "MALE"]
female <- thca[ , !is.na(thca$gender) & colData(thca)$gender == "FEMALE"]
dim(male)
[1] 20115 149
dim(female)
[1] 20115 406
na <- thca[ , is.na(thca$gender)]
dim(na)
[1] 20115 17
barplot(summary(colData(thca)$gender), main="Patient's gender", xlab="Gender", ylab="Total counts")
#Age
barplot(summary(colData(thca)$age_at_diagnosis),main="Patient's age", xlab="Age", ylab="Total counts")
The last “age” column correspond to NA, but we do not remove them in this plot because we may need those samples in further analysis. A we can see, we have 513 tumors and 59 normal samples; or, lookig at the gender, 149 males and 406 females, with 17 samples from unknown gender.
In this part, we examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure @ref(libsizes) below shows the sequencing depth per sample, also known as library sizes, in increasing order.
Fig1.Library sizes in increasing order.
In this plot we can visually observe again that we have more samples from tumor than normal, both types have similar sequencing depth. However, the fugure also reveals substantial differences in sequencing depth between samples, this is why the discardation of those samples whose depth is substantially lower than the rest needs to be taked into consideration.
Sample Depth
In order to remove those samples with a lower sequencing depth value, we first calculate the sample depth @ref(smpdepth), and then select a cutoff to remove samples.
sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(thca), 6, 12)
sort(sampledepth)
DJ.A3VJ ET.A3BP ET.A3DO EL.A4K7 EL.A4K2 EL.A3CP EM.A22Q E8.A417 DJ.A13W
0.5 2.0 2.2 2.6 3.2 14.4 15.6 16.9 18.0
EM.A22O KS.A41I EL.A3ZS ET.A25M ET.A25L BJ.A28T EM.A3FQ FK.A4UB FE.A3PD
19.2 20.5 20.7 20.9 21.7 22.3 22.9 23.0 23.1
IM.A41Z DJ.A13R DJ.A1QG KS.A41F EM.A2CK BJ.A45E EM.A1YB ET.A4KN EM.A1CT
23.3 23.7 23.7 23.7 24.1 24.9 26.4 26.7 27.4
EL.A4JX EL.A3ZL BJ.A0ZE KS.A4I7 BJ.A28W DJ.A2PY DJ.A4UR FY.A3R6 DE.A7U5
27.5 28.5 29.6 29.6 30.4 30.4 30.5 30.6 31.0
EM.A1YE EL.A3ZG DJ.A2PQ BJ.A28W EM.A3O3 EM.A22M EM.A2CM DE.A4MA EL.A3H5
31.2 31.6 32.7 32.7 32.8 32.9 33.5 33.7 33.7
BJ.A18Z QD.A8IV EM.A1CW KS.A4I1 FY.A3TY EL.A3ZQ MK.A84Z L6.A4EP DE.A4M8
34.0 34.1 34.3 34.4 34.8 34.9 34.9 35.3 35.4
ET.A25R GE.A2C6 ET.A3DR FE.A230 DJ.A2QA EM.A4G1 FY.A76V KS.A4ID EM.A4FM
35.6 35.6 35.7 35.8 36.0 36.2 36.2 36.2 36.6
BJ.A0Z9 EM.A3SZ ET.A25K EL.A3CW DE.A4MB DJ.A2Q0 FE.A22Z FE.A3PC KS.A4IB
36.9 36.9 36.9 37.0 37.1 37.4 37.5 37.7 37.7
E8.A437 DJ.A4UT EL.A4K1 EM.A3ST E8.A2JQ BJ.A28S E8.A3X7 EM.A3O9 KS.A4I3
38.0 38.1 38.2 38.5 38.6 38.7 38.9 39.0 39.3
BJ.A191 EM.A1YC BJ.A0Z2 BJ.A45F EM.A3SU ET.A40S E8.A44K EL.A3D5 DJ.A4V4
39.4 39.4 39.6 39.6 39.6 39.6 39.7 39.9 40.0
FY.A3YR EM.A2CO EM.A4FV ET.A39M BJ.A2NA DJ.A2PZ BJ.A0Z5 BJ.A45H EM.A2OY
40.4 40.5 40.6 40.7 40.7 40.8 40.9 41.2 41.2
ET.A3DU KS.A4I9 DJ.A3VL L6.A4EQ EL.A4K9 BJ.A2P4 EM.A3FN IM.A4EB FY.A3TY
41.2 41.2 41.3 41.5 41.6 41.7 41.7 41.8 41.8
4C.A93U DJ.A13V E8.A433 EM.A3ST BJ.A3PT EM.A3SY KS.A41I BJ.A0ZJ ET.A3DQ
41.9 42.0 42.0 42.2 42.3 42.4 42.4 42.6 42.6
DE.A4MD ET.A39R J8.A4HW KS.A4I5 EM.A2P1 KS.A4IC IM.A41Y KS.A41L FY.A3BL
42.7 42.8 42.9 42.9 43.0 43.0 43.1 43.1 43.2
FY.A4B0 EL.A3D4 MK.A4N9 EL.A3CM DJ.A1QO J8.A4HY DJ.A3V0 EM.A2CP BJ.A28Z
43.3 43.6 43.6 43.8 44.0 44.0 44.1 44.1 44.2
DE.A69K DJ.A3V6 EM.A2P1 BJ.A4O8 DE.A4M9 EL.A3N3 EL.A3ZN EM.A3AO BJ.A0YZ
44.4 44.4 44.4 44.5 44.5 44.6 44.6 44.6 44.7
E8.A44M BJ.A3PR DE.A0XZ BJ.A3PR CE.A13K DJ.A3UW EL.A3GO EL.A3ZT EL.A4KI
44.7 44.8 44.8 44.8 44.9 44.9 44.9 44.9 44.9
EM.A3AJ EM.A2OW EM.A1CU DJ.A2QC EL.A3CU BJ.A0ZC CE.A3MD DJ.A4UL BJ.A45I
44.9 45.0 45.0 45.1 45.1 45.2 45.2 45.2 45.3
EL.A3CX E8.A2JQ EL.A3ZM DJ.A13M E3.A3E0 ET.A2N3 E8.A438 EL.A3CL J8.A3YH
45.3 45.3 45.4 45.5 45.5 45.7 45.9 45.9 45.9
EM.A4FR EL.A3GZ DJ.A2PS EL.A3GR EM.A4FN EM.A3FP ET.A39T DJ.A3VM EL.A4K6
46.0 46.0 46.3 46.4 46.4 46.5 46.5 46.6 46.6
EM.A2OV E8.A2EA EM.A1YD ET.A2N0 FY.A40N BJ.A291 EL.A4JV ET.A2N5 BJ.A45C
46.6 46.7 46.7 46.7 46.7 46.8 46.8 46.8 46.9
ET.A3BW E3.A3E5 BJ.A28R EL.A3H8 EL.A3GP EL.A3T1 DO.A1K0 EL.A3H3 EM.A4FO
46.9 47.0 47.1 47.2 47.3 47.3 47.4 47.4 47.4
FY.A3NN J8.A3YD EL.A3ZP CE.A485 EL.A3T3 DJ.A3VK ET.A2MX BJ.A0ZG BJ.A3PU
47.4 47.4 47.5 47.7 47.7 47.8 47.8 47.9 47.9
EL.A3ZP ET.A3DT EL.A3ZK ET.A40P BJ.A0Z0 DE.A0Y3 EM.A2CR ET.A39N CE.A3ME
47.9 47.9 48.0 48.1 48.2 48.2 48.2 48.2 48.3
DJ.A4UW EL.A3TA EL.A3ZS EM.A3AQ ET.A2MY EM.A1YC L6.A4ET EM.A2OX EM.A3AR
48.3 48.3 48.3 48.3 48.3 48.4 48.4 48.5 48.5
DJ.A3VE EL.A3GU CE.A484 DJ.A2PT DO.A1JZ EL.A3ZK ET.A2MX E3.A3E2 EL.A3T8
48.6 48.6 48.7 48.7 48.7 48.9 48.9 49.0 49.0
FY.A2QD DJ.A4UQ FY.A40K EL.A3ZO DJ.A4UP DJ.A4V0 E8.A419 EL.A3D0 EL.A3D6
49.0 49.1 49.1 49.1 49.2 49.2 49.2 49.2 49.2
EL.A3D1 EL.A3GQ J8.A42S ET.A39O J8.A4HW BJ.A192 DJ.A2QB DJ.A3VD E3.A3DZ
49.3 49.3 49.3 49.4 49.4 49.5 49.6 49.6 49.6
DJ.A2Q1 DJ.A3UU E8.A413 EL.A3T0 EL.A3CN EL.A3CS ET.A40T DE.A4MD EM.A2CN
49.7 49.7 49.7 49.7 49.8 49.8 49.8 49.9 49.9
DO.A1JZ BJ.A28X EL.A4JZ EM.A1CV BJ.A0ZA DJ.A1QL DJ.A2PO EM.A22I CE.A481
50.0 50.1 50.1 50.1 50.2 50.3 50.3 50.3 50.4
EL.A3MX EL.A4K0 EL.A4JW ET.A39K ET.A3BV EL.A4KH ET.A2N4 EM.A1CS BJ.A2N9
50.4 50.4 50.5 50.5 50.5 50.6 50.6 50.6 50.7
BJ.A18Y DO.A2HM EM.A1CU BJ.A28X BJ.A45G DJ.A4V5 EM.A2P3 ET.A39P EM.A2P0
50.8 50.8 50.8 50.9 51.0 51.0 51.0 51.0 51.1
J8.A3O2 ET.A3DW DJ.A3V3 EL.A3CO EM.A3SX BJ.A2N9 ET.A25N DJ.A4V2 EM.A2CJ
51.1 51.1 51.2 51.2 51.2 51.2 51.3 51.4 51.4
BJ.A3F0 EM.A3AI FY.A3I5 DJ.A13U EL.A3CZ EL.A3GZ FE.A232 EL.A3T8 EL.A3ZR
51.5 51.5 51.5 51.6 51.7 51.8 51.8 51.8 51.9
J8.A3YF EL.A3ZT DJ.A3UZ EM.A22P DE.A0Y2 EL.A3H2 EM.A4FK ET.A39S BJ.A45K
51.9 51.9 52.1 52.3 52.4 52.4 52.4 52.4 52.5
DJ.A13L EL.A3CY FY.A3RA DJ.A3UK H2.A422 J8.A3O2 E3.A3E3 EL.A3T0 ET.A3BO
52.5 52.6 52.6 52.7 52.7 52.7 52.9 53.0 53.0
J8.A3YE ET.A39J ET.A3DW MK.A4N6 DJ.A1QQ EM.A3FO ET.A25G FE.A3PA FY.A40M
53.0 53.1 53.1 53.1 53.2 53.2 53.2 53.2 53.2
EL.A3GV EL.A3MW EL.A3H2 DJ.A3UR EM.A2CS FE.A235 EL.A3MW ET.A3BX BJ.A0ZH
53.3 53.3 53.4 53.6 53.6 53.6 53.6 53.7 53.8
DJ.A3UN EL.A3MY EM.A3FQ FY.A4B4 IM.A420 DE.A3KN DJ.A2PW EL.A3T2 EL.A3CT
53.8 53.9 53.9 54.0 54.0 54.1 54.1 54.3 54.4
EL.A3T6 EL.A3CR EM.A22J FE.A237 BJ.A45D BJ.A45J EM.A3FR FY.A40L ET.A3BN
54.4 54.5 54.5 54.5 54.6 54.6 54.6 54.6 54.7
CE.A482 DJ.A3VI EM.A1YA EM.A2CS BJ.A190 EL.A3TB EL.A3T9 EM.A3FL KS.A41L
54.8 54.8 54.8 54.8 54.9 54.9 55.0 55.0 55.0
EL.A4KD FY.A3ON J8.A3O0 DJ.A3VG EL.A3GX EM.A22K EM.A3O7 E8.A432 E8.A436
55.1 55.1 55.1 55.2 55.2 55.2 55.2 55.3 55.3
J8.A3YH EL.A3H7 EL.A3ZR BJ.A290 DJ.A3UP EL.A3GS DJ.A2PR DJ.A2PV EL.A3N2
55.3 55.3 55.4 55.5 55.5 55.5 55.6 55.6 55.6
KS.A41J DJ.A3UO EL.A3H4 DE.A4MC DJ.A3US FK.A3S3 DJ.A3UQ E8.A434 FE.A3PB
55.7 55.8 55.8 55.9 55.9 55.9 56.0 56.1 56.1
FY.A4B3 DJ.A2PN H2.A421 IM.A3U2 EM.A1CW EM.A2CU DJ.A2Q9 EL.A3ZH EM.A4FQ
56.1 56.2 56.2 56.2 56.3 56.3 56.5 56.5 56.5
EM.A1CT EL.A3CV ET.A3BS DJ.A3UY ET.A25J CE.A483 EL.A3MZ FE.A239 EL.A3MY
56.6 56.7 56.7 56.8 56.8 56.9 56.9 56.9 56.9
DJ.A3UX E8.A416 EL.A3T7 ET.A25O ET.A3BU EM.A3FK ET.A3DS GE.A2C6 DJ.A2PX
57.0 57.0 57.0 57.2 57.2 57.3 57.3 57.3 57.4
DJ.A1QE DJ.A3UT EL.A3H1 EM.A3FM EL.A3ZL DJ.A1QI EL.A4K4 BJ.A2NA DJ.A13S
57.5 57.5 57.5 57.7 57.7 57.8 57.9 58.0 58.3
EM.A3OB BJ.A2N8 FY.A3R9 EL.A3T3 EL.A3GW IM.A3ED ET.A40R EM.A2P2 DJ.A13P
58.3 58.5 58.5 58.5 58.6 58.6 58.7 58.8 58.9
DJ.A13X DJ.A3V5 BJ.A0ZB DJ.A2Q2 EL.A3ZG DJ.A3VA EL.A3N3 FY.A3R8 ET.A2N5
58.9 58.9 59.0 59.0 59.0 59.1 59.1 59.2 59.2
DJ.A2PU FK.A3SH DJ.A3UV EM.A3SU EM.A4FU H2.A3RH J8.A3NZ DJ.A3V4 EL.A3TB
59.3 59.3 59.4 59.4 59.4 59.4 59.4 59.7 59.7
DJ.A3V7 DJ.A3V8 EL.A3ZM J8.A3O1 ET.A39L H2.A3RI EM.A4FH DJ.A2Q5 ET.A25P
59.8 59.8 59.8 59.8 59.9 60.1 60.2 60.3 60.4
BJ.A4O9 EM.A3FJ EL.A3T2 E3.A3DY EL.A3T6 EL.A3ZH E3.A3E1 EM.A1CS ET.A3DV
60.5 60.5 60.8 60.9 61.0 61.0 61.1 61.1 61.1
E8.A414 ET.A39I FY.A3R7 H2.A26U IM.A3EB DJ.A13T DJ.A2PP EM.A4FF MK.A4N7
61.2 61.3 61.3 61.3 61.4 61.6 61.6 61.7 61.8
EL.A3TA IM.A3U3 EL.A3ZO FK.A3SG DJ.A3V2 ET.A3BT EL.A3H7 EM.A22N FK.A3SD
61.8 62.0 62.1 62.3 62.5 62.6 62.8 62.8 62.9
BJ.A3EZ BJ.A28V FK.A3SB H2.A2K9 DJ.A3VF FE.A238 EM.A3AK BJ.A3PU BJ.A0ZF
63.0 63.1 63.1 63.2 63.3 63.3 63.5 63.5 63.6
EL.A3N2 EL.A4KG EL.A3H1 EM.A3AL EM.A3AN FY.A3NP DJ.A3V9 ET.A3DP EM.A2CT
63.7 63.8 63.8 64.1 64.2 64.7 64.9 65.0 65.3
FE.A23A FE.A236 BJ.A2N7 ET.A3DP E8.A415 EM.A2CL FE.A234 KS.A41J BJ.A2N7
65.3 65.4 65.4 65.5 65.8 65.8 65.8 65.9 66.2
FY.A3WA EM.A2CQ H2.A3RI EL.A3ZQ ET.A2MZ J8.A3YG EL.A3T1 ET.A25I E8.A418
66.3 66.4 66.4 66.5 67.8 68.1 68.2 68.9 69.0
EM.A3O6 EM.A1CV FY.A3W9 L6.A4EU BJ.A290 ET.A3BQ DJ.A2Q3 DJ.A3UM BJ.A0Z3
69.1 69.2 69.4 69.4 69.4 69.7 69.8 69.9 70.4
EM.A3O8 DJ.A1QH EL.A3MX ET.A40Q FY.A3I4 FK.A3SE ET.A4KQ FE.A231 DE.A69J
70.8 71.4 72.0 72.3 72.3 72.4 72.6 72.8 73.0
DJ.A1QD BJ.A28R CE.A27D EM.A3AP DJ.A2Q7 EL.A3GY E8.A242 DE.A2OL DJ.A2Q6
73.4 73.5 73.5 73.5 73.6 73.6 74.6 75.0 75.5
DJ.A1QM DJ.A2Q4 DJ.A3VB EL.A3T7 EM.A22L FY.A3NM DJ.A13O EM.A2OZ DJ.A1QN
75.8 76.6 76.6 77.2 78.9 79.1 82.9 84.0 85.7
DJ.A1QF BJ.A2N8 H2.A2K9 EM.A3OA FE.A233
86.2 89.1 91.1 95.4 104.0
Moreover, we know that this cancer has different affection depending on the gender, being women more susceptibles, so maybe we would like to compare samples regarding the gender. Samples with unknown (NA) gender will be removed in order to allow further analysis.
In conclusion, two filters are applied @ref(maskdg):
maskd <- sampledepth > 40 #sample depth filter
maskg <- !is.na(thca$gender) #gender filter
dim(thca)
[1] 20115 572
thca.filt <- thca[,maskd&maskg]
dge.filt <- dge[,maskd&maskg]
dim(thca.filt)
[1] 20115 467
Once the filters are applied, the dataset contains 20115 genes and 467 samples. This way, none of our samples has a low sample depth @ref(filtlibsizes).
Fig1.Library sizes in increasing order.
One way to normalize RNA-seq data is an adjustment to compare across features in a sample, this can be performed using count per million reads (CPM). The distribution of expression values per samle in terms of logarithmic CPM units is ploted @ref(distRawExp) separating by tumor and normal samples due to the large number of samples. A box plot of the expression values per samples is also performed @ref(distboxplot) in order to have another visual way to interpret the data and spot location differences.
Fig2. Non-parametric density distribution of expression profiles per sample.
Fig3. Box plot of the distribution expression values across samples
From this plots, it is observes that all samples follow a similar pattern with a group of highly expressed genes and another group of lowly expressed genes, following the common pattern of expression. From the box plot we can observe that there are no samples that deviate from the average interquartile range. For all this reasons, we assume that we do not need to normalise among samples.
In order to identify lowly expressed genes, the average expression per gene through all samples is calculated. The distribution of thos values across genes is represented @ref(exprdist).
Fig4. Distribution of average expression level per gene.
The most part of the genes have at least a log2CPM greater than 0, and we want to remove those one lowly expressed.
We can fiter lowly-expreesed genes following two criteria:
Filter out genes below a minimum average log2CPM throught the samples.
Filter out genes with fewer than a given number of sample meeting a minimum log2CPM.
First, we are going to filter out genes with the second approach, and visualize it with an histogram to know if we want also to perform the other filter.
#calculate cpm cutoff around all samples
cpmcutoff <- round(10/min(dge.filt$sample$lib.size/1e+06), digits = 1)
cpmcutoff
[1] 0.2
#select number of samples meeting that cutoff
nsamplescutoff <- min(table(thca.filt$type))
nsamplescutoff
[1] 53
mask <- rowSums(cpm(dge.filt) > cpmcutoff) >= nsamplescutoff
thca.filt2 <- thca.filt[mask, ]
dge.filt2 <- dge.filt[mask, ]
dim(thca.filt2)
[1] 15408 467
Fig5. Distribution of average expression level per gene with first filter.
Now, we have a dataset of 15408 genes to analyzed.
In this histogram, we have all genes that have passed the cutoff (red bars). However, we observe that the first red columns are not very representative of the group, and as we have a lot of genes to perform the analysis, we are going toapply a second filter.
To remove them, we choose a cutoff of 3 log2 CPM unit as minimum value of expression and visualize the fitered dataset with another histogram.
mask <- avgexp > 3
dim(thca.filt2)
[1] 15408 467
thca.filt3 <- thca.filt[mask, ]
dim(thca.filt3)
[1] 9423 467
dge.filt3 <- dge.filt[mask, ]
dim(dge.filt3)
[1] 9423 467
Fig6. Distribution of average expression level per gene with second filter.
After filtering the dataset with both methods, we have finally 9423 genes to perform a differential expression analysis.
Store un-normalized versions of the filtered expression data.
saveRDS(thca.filt3, file.path("./", "thca.filt.unnorm.rds"))
saveRDS(dge.filt3, file.path("./", "dge.filt.unnorm.rds"))
We calculate now the normalization factors on the filtered expression data set.
dge.filt3 <- calcNormFactors(dge.filt3)
head(dge.filt3$samples$norm.factors)
[1] 0.8471913 0.9717390 0.9386724 1.0353601 1.0518953 0.9920039
Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.
assays(thca.filt3)$logCPM <- cpm(dge.filt3, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)
Store normalized versions of the filtered expression data.
saveRDS(thca.filt3, file.path("./", "thca.filt.rds"))
saveRDS(dge.filt3, file.path("./", "dge.filt.rds"))
MA plots are used for detecting intensity dependent biases, by comparing two groups of the dataset,
We first examine a global MA-plot for normalized and non-normalized data to have a general idea of the possible dependencies.
< table of extent 0 >
Fig7. MA-plots of normalized and non-normalized data.
Fig7. MA-plots of normalized and non-normalized data.
In the non-normalized plot, we observe there are a lot of genes non differential expreesion (logFC ~ 0) and also several ones with a a low log2CPM which are removed in the normalized plot to avoid artifacts in the posterior analysis. In the normalized plot, we also observe a smother shape with less outliers.
We examine now the MA-plots of the normalized expression profiles for each tumor and normal samples sparately, in order to observe if there is any sample with a anomalous expression profile. We look first to the tumor samples in Figure @ref(fig:maPlotsTumor).
Fig8. MA-plots of the tumor samples.
Fig8. MA-plots of the tumor samples.
Fig8. MA-plots of the tumor samples.
Fig8. MA-plots of the tumor samples.
Fig8. MA-plots of the tumor samples.